Motivation

One of our group members’ had the unfortunate experience of getting her car towed last year, due to parking near by a fire hydrant in Minnesota. A news report published by New York Post about locations drivers were most likely to get a parking ticket in NYC between October 2020 and September 2021. Thus, we were intrigued by the potential of investigating parking violations in NYC in 2021, with a particular interest in fire hydrant parking violations.

 

Questions

  1. When and where are parking violations in NYC most likely to be found?
  2. Which borough had the most violations? Which borough paid the most in fines?
  3. What does the distribution of violations look like over other variables in the dataset borough, date, vehicle type)?
  4. Are the number of parking violations related to fire hydrants the same across the five boroughs? Can we find the hydrants that seem to be ticketed the most commonly?

 

Data Processing and Cleaning

We used three datasets from NYC OpenData to construct our main dataset for analysis and also utilized a few other datasets (both available online and scraped from the internet) to supplement our analysis

  • Open Parking and Camera Violations (dataset)

The Open Parking and Camera Violations dataset hosted online contains data on 73.6 million parking violations from 2016 to 2021. We filtered the dataset on the website page to remove redundant information and downloaded the dataset after filtering, as it was too large to download outright. We stored the dataset in a Google Drive folder, as it exceeded the size limits for GitHub. We are interested in parking violation in five boroughs, Manhattan, Kings, The Bronx, State Island, and Queens. After importing the dataset into R and tidying the dataset, we were ultimately interested in the following variables:

  • summons_number: unique violation identifier

  • borough: borough violation was issued in

  • issue_date (also split into month, date, year): date violation was issued on

  • weekday: day of the week violation was issued

  • hour and min: time violation was issued

  • Parking Violations Issued - Fiscal Year 2021 (dataset)

The Parking Violations Issued - Fiscal Year 2021 dataset provides information about the violations issued during the respective fiscal year. This dataset provides us with approximate locations where each parking violation was issued. In addition, this dataset also provides us with information about the car type. We mostly made use of the following variables in the dataset:

  • summons_number: unique violation identifier, used to join dataset with the Open Parking dataset

  • geo_nyc_address: combination of a few address components in the raw data, passed into NYCGeoSearch in mapping analysis

  • vehicle_body_type: type of vehicle the violation was issued to

  • vehicle_year: year of the vehicle

  • Hydrants in NYC (dataset)

This dataset contains the locations of all fire hydrants in NYC. After tidying the dataset and recoding some columns, we are most interested in the following variables:

  • borough: borough the hydrant is located in
  • unitid: unique hydrant identifier
  • lat and long: latitude and longitude coordinates of the hydrant

Exploratory analysis

In order to have a better understanding of the datasets and trends of parking violations in the city may follow, we conducted some exploration mainly through visualization.

We first tried to explore the overall distribution of the parking violation frequency. Then, we tried to investigate how the number of parking violations distributed over month, week, time of the day and vehicle type. Finally, we looked at the distribution of fine amounts by boroughs.

Violation frequency

Distribution of parking violation

We first Distribution of number of parking violation in five boroughs (Manhattan, Kings, Bronx, Queens, and Staten Island). This bar graph shows the frequency of tickets in five boroughs in 2021. Brooklyn has the greatest number of tickets among five boroughs. Staten Island has the lowest number of tickets in 2021.

violation1 %>% 
  count(borough) %>% 
  mutate(
    borough = fct_reorder(borough, n)) %>% 
  plot_ly(x = ~borough, y = ~n, color = ~borough, type = "bar", colors = "viridis") %>% 
  layout(title = "Frequency of Tickets in Boroughs in 2021",
         xaxis = list(title = "Borough"),
         yaxis = list(title = "Number of Tickets"))

The bar graph further break down the number of tickets in five borough from January to October in 2021. There is sharp increasing of number of tickets in October. The distribution of number of tickets are similar from January to October across Borough. Brooklyn has the highest number of tickets, while State Island has the least number of tickets. There is large increase in number of tickets in October in all boroughs except Staten Island.

violation1 %>% 
  group_by(borough) %>%
  count(month) %>% 
  mutate(month = month.abb[as.numeric(month)],
         month = fct_relevel(month, c("Jan", "Feb", "Mar", "Apr","May","Jun", "Jul", "Aug", "Sep", "Oct"))) %>% 
      plot_ly(x = ~month, y = ~n, color = ~borough, type = "bar", colors = "viridis") %>%
  layout(title = "Frequency of Tickets in Boroughs in 2021",
         xaxis = list(title = "Borough"),
         yaxis = list(title = "Number of Tickets in Each Month in 2021"))
Top 10 violations in NYC in 2021

We want to know the most common ten violation types based on the total frequency in NYC in 2021.

total_violation = violation1 %>% 
  count(violation) %>% 
  mutate(
    violation = fct_reorder(violation, n),
    ranking = min_rank(desc(n))
  ) %>% 
  filter(ranking <= 10) %>% 
  arrange(n)

total_violation %>% 
  plot_ly(x = ~violation, y = ~n, type = "bar") %>% 
  layout(title = "10 Most Common Violations in Total in NYC in 2021",
         xaxis = list(title = "Violation"),
         yaxis = list(title = "Number of Tickets in Total"))
5 common violation types in five boroughs

We want to find the top five common violation types in each borough, and see whether there exists difference among boroughs. We found that the most common 5 reasons of getting a violation tickets in NYC among all five boroughs are: 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant.

## top 5 reasons of violation tickets
violation_type = violation1 %>% 
  group_by(borough) %>% 
  count(violation) %>% 
  mutate(
    violation = fct_reorder(violation, n),
    index = min_rank(desc(n))) %>% 
  filter(index <= 5) 

Bronx

bronx_vio_type = 
  violation_type %>% 
  filter(borough == "Bronx") %>% 
  ggplot(aes(x = violation, y = n, fill = violation)) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Frequency of Violation Type in Bronx in 2021", 
    x = "Violation Type", 
    y = "Number of tickets") +
  theme(
    axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
    legend.text = element_text(size = 5)
  )

ggplotly(bronx_vio_type)

Brooklyn

brooklyn_vio_type = 
  violation_type %>% 
  filter(borough == "Brooklyn") %>% 
  ggplot(aes(x = violation, y = n, fill = violation)) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Frequency of Violation Type in Brooklyn in 2021", 
    x = "Violation Type", 
    y = "Number of tickets") +
  theme(
    axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
    legend.text = element_text(size = 5)
  )

ggplotly(brooklyn_vio_type)

Manhattan

manhattan_vio_type = 
  violation_type %>% 
  filter(borough == "Manhattan") %>% 
  ggplot(aes(x = violation, y = n, fill = violation)) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Frequency of Violation Type in Manhattan in 2021", 
    x = "Violation Type", 
    y = "Number of tickets") +
  theme(
    axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
    legend.text = element_text(size = 5)
  )

ggplotly(manhattan_vio_type)

Queens

queens_vio_type = 
  violation_type %>% 
  filter(borough == "Queens") %>% 
  ggplot(aes(x = violation, y = n, fill = violation)) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Frequency of Violation Type in Queens in 2021", 
    x = "Violation Type", 
    y = "Number of tickets") +
  theme(
    axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
    legend.text = element_text(size = 5)
  )

ggplotly(queens_vio_type)

Staten Island

si_vio_type = 
  violation_type %>% 
  filter(borough == "Staten Island") %>% 
  ggplot(aes(x = violation, y = n, fill = violation)) + 
  geom_bar(stat = "identity") + 
  labs(
    title = "Frequency of Violation Type in Staten Island in 2021", 
    x = "Violation Type", 
    y = "Number of tickets") +
  theme(
    axis.text.x = element_text(angle = 60, vjust = .5, hjust = 1),
    legend.text = element_text(size = 5)
  )

ggplotly(si_vio_type)

Violation Frequency

Violation timeline

For this part of data exploratory, we want to use visualization over time to find out some pattern of parking violation frequency in NY in 2021. We want to know how month, weekday and specific time of the day are associated with the parking violation counts.

Violation frequency vs. month for each borough
violation %>%
  mutate(month = month.name[as.numeric(month)]) %>%
  mutate(month = as.factor(month),
         month = fct_relevel(month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October")) %>%
  relocate(month) %>%
  group_by(borough, month) %>%
  summarize(n_obs = n()) %>%
  ggplot(aes(x = month, y = n_obs, group = 1, color = borough)) + 
  geom_point() + geom_line() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 5) ) +
  labs(title = "The observation of parking violations of each month \n in 2021 for each borough") + facet_wrap(~ borough, nrow = 1)

The above plot shows the total number of parking violation within each month for different boroughs. From the plot we can see that all boroughs have a similar pattern, the total violation counts in February is lower, and the total violation counts in October is higher comparing to other months. We can also see that total violation count in Staten Island is much more lower than the other four boroughs.

Violation frequency vs. day for each month
violation %>%
  mutate(month = month.name[as.numeric(month)]) %>%
  mutate(month = as.factor(month),
         month = fct_relevel(month, "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")) %>%
  relocate(month) %>%
  group_by(month, day) %>%
  summarize(n_obs = n()) %>%
  ggplot(aes(x = day, y = n_obs, group = 1, color = month)) + 
  geom_point(size = 0.5) + geom_line() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1) ) +
  labs(title = "The observation of parking violations of day in each \n month in 2021") + facet_wrap(~ month, nrow = 3)

After visualizing the monthly pattern, we also want to know if there is any pattern within each month. The above plot shows the total daily parking violation counts for each month, and we can observe a weekly pattern here.

Violation frequency vs. weekday for each borough
violation %>% 
  mutate(weekday = as.factor(weekday)) %>%
  mutate(weekday = fct_relevel(weekday, "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) %>%
  relocate(weekday) %>%
  group_by(borough, weekday) %>%
  summarize(n_obs = n()) %>%
  ggplot(aes(x = weekday, y = n_obs, group = 1, color = borough)) + 
  geom_point() + geom_line() + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1) ) + 
  facet_wrap(~ borough, nrow = 1) +
  labs(title = "The observation of parking violations vs. weekday in each \n borough in 2021")

In order to explore more of the weekly parking violation pattern, we plot the violation counts of each weekday for every borough. We can see that all boroughs have a similar pattern, where Saturday has the lowest violation counts and Friday had the second lowest counts comparing to all other weekdays.

Violation frequency vs.hour of the day for each borough
violation %>%
  mutate(hour = as.numeric(hour)) %>%
  group_by(borough, hour) %>%
  summarize(n_obs = n()) %>%
  filter(hour <= 24) %>%
  ggplot(aes(x = hour, y = n_obs, color = borough)) + geom_point() + geom_line() +
  labs(title = "The total number of violations during 24 hours \n of a day in 2021 for each borough") +
  facet_wrap(~ borough, nrow = 2) 

Furthermore, we also try to explore the 24-hour violation pattern within a day. From the above plot, all boroughs also show a similar daily pattern. The number of violations tends to increase from 6:00 AM to 8:00 AM, remain high in 8:00 AM to 2:00 PM, decrease from 2:00PM to 7:00 PM, and remain low from 7:00 PM to 5:00 AM.

Violation Timeline

Violations by Vehicle Body Type

The following plot describes the frequency of violation by vehicle body types. There are mainly top 5 types of vehicle body at the top 5 ranking(unknown type excluded), which are SDN, PICK, 2DSD, 4DSD, SUBN. The type abbreviations are the exact match of the body type listed on the vehicle registration. The law defines a suburban(SUBN) as a vehicle that can be used to carry passengers and cargo. Other common body types and codes for vehicles registered in the passenger vehicle class are sedan (SDN), a two-door sedan (2DSD) and a four-door sedan (4DSD), pick-up trucks (PICK).

violation_df = 
  violation %>%
  count(vehicle_body_type) %>% 
  mutate(
    vehicle_body_type = fct_reorder(vehicle_body_type, n) 
  ) %>% 
  filter(
    vehicle_body_type != "NA",
    n > 10000
  ) %>% 
  ggplot(aes(x = vehicle_body_type, y = n, fill = vehicle_body_type)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Frequency of Violation by Vehicle Body Type",
    x = "Vehicle Body Types",
    y = "Numbers"
  ) +
  geom_text(aes(label = n), position = position_dodge(width = 0.9), vjust = -0.25)
violation_df

Violation by Vehicle Type

Fine amount

Mean Fine Amount Across Boroughs

First, let us compute the average fine amounts across the different boroughs:

violation %>%
  group_by(borough) %>%
  summarise(
    average_fine = mean(fine_amount)
  ) %>%
  knitr::kable(caption = "Average fine amounts across the different boroughs") 
Average fine amounts across the different boroughs
borough average_fine
Bronx 76.46053
Brooklyn 74.34834
Manhattan 82.25287
Queens 68.42849
Staten Island 66.92960

We observe that the average fine amount is highest in Manhattan and lowest in Staten Island.

violation %>% group_by(borough) %>% 
  ggplot(aes(x = fine_amount, fill = borough)) + 
  geom_boxplot(alpha = 0.3) +
  labs(
    x = "Fine Amount",
    fill = "Borough",
    title = "Fine Amount Across Boroughs")

To see if fine amounts do differ significantly across boroughs, we will conduct a simple one-way ANOVA on these two variables.

violation_reg <- lm(fine_amount~borough, data = violation)
anova(violation_reg)
## Analysis of Variance Table
## 
## Response: fine_amount
##                Df     Sum Sq  Mean Sq F value    Pr(>F)    
## borough         4   46651474 11662868   16653 < 2.2e-16 ***
## Residuals 1733700 1214179375      700                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There appear to be differences between fine amounts and at least 2 boroughs.

Which borough tends to pay the most in excess fees?

Now, we will investigate tendencies in amounts paid in excess of the baseline fine amount.

violation %>%
  select(borough, contains("amount")) %>%
  mutate(
    net_total_amount = fine_amount + penalty_amount + interest_amount - reduction_amount,
    excess_amount = penalty_amount + interest_amount - reduction_amount
  ) %>%
  group_by(borough) %>%
  summarise(
    avg_excess_amount = mean(excess_amount),
    avg_net_total_amount = mean(net_total_amount),
    avg_total_amount = mean(fine_amount)
  ) %>% 
  knitr::kable(caption = "Amounts paid in excess of the baseline fine amount")
Amounts paid in excess of the baseline fine amount
borough avg_excess_amount avg_net_total_amount avg_total_amount
Bronx -5.127610 71.33292 76.46053
Brooklyn -5.593115 68.75523 74.34834
Manhattan -7.304869 74.94800 82.25287
Queens -6.843527 61.58496 68.42849
Staten Island -6.887677 60.04193 66.92960

It seems that parking violation fine adjustments are typically reduced, not increased, if the fine is adjusted.

violation %>%
  select(borough, contains("amount")) %>%
  mutate(
    excess_amount = penalty_amount + interest_amount - reduction_amount
  ) %>%
  filter(excess_amount !=  0) %>%
  
  ggplot(aes(x = excess_amount, fill = borough)) +
  geom_boxplot(alpha = 0.3) +
  labs(
    x = "Fine Adjustment Amount",
    fill = "Borough",
    title = "Fine Adjustments by Borough"
  )

Fine Amount

 

Statistical Analyses

To answer the research questions, we conducted a number of statistical tests to test if the trends we observed were significant.

First, We performed a one-way ANOVA test across the weekdays and months parking violations occurred in to see whether there is significant differences within the groups of time.

ANOVA Test

whether month and weekday are associated with parking violation

As from the data visualization, we observed patterns for the parking violation number over month and over weekday. In order to discuss further about how month and weekday are associated with the frequency of parking violation, we try to use an ANOVA test for weekday and month.

\(H_0\): The mean numbers of parking violation are not different across months and weekdays.

\(H_1\): The mean numbers of parking violation are different across months and weekdays.

fit_df = violation %>%
  mutate(month = as.factor(month)) %>%
  group_by(month, weekday, day) %>%
  summarize(n_obs = n())

fit_model_weekday = lm(n_obs ~ weekday, data = fit_df)
anova(fit_model_weekday) %>% knitr::kable(caption = "One way anova of violation frequency and weekday")
One way anova of violation frequency and weekday
Df Sum Sq Mean Sq F value Pr(>F)
weekday 6 1188117175 198019529 60.08402 0
Residuals 297 978825984 3295710 NA NA
fit_model_month = lm(n_obs ~ month, data = fit_df)
anova(fit_model_month) %>% knitr::kable(caption = "One way anova of violation frequency and month")
One way anova of violation frequency and month
Df Sum Sq Mean Sq F value Pr(>F)
month 9 494640293 54960033 9.662275 0
Residuals 294 1672302865 5688105 NA NA

From the above results, both p-values are very small, so we can say that the parking violation counts are associated with month and weekday.

Chi-square Test

whether the violation types vary among boroughs

Now, we want to know whether there is difference in proportions of ticket amounts in violation types in different boroughs, and we perform a chi-squared test to test the homogeneity of violation types across boroughs.

\(H_0:\) the proportions of tickets of violation type in five boroughs are equal;

\(H_1:\) there exists at least one borough’s proportion of tickets is different.

## 1) No parking-street cleaning, 2) Failure to display a muni-meter receipt, 3) Inspection sticker-expired/missing, 4) Registration sticker-expired /missing, and 5) Fire hydrant. 

five_common_violation = 
  violation1 %>%
  select(borough, violation) %>% 
  filter(violation %in%
           c("NO PARKING-STREET CLEANING",
             "FAIL TO DSPLY MUNI METER RECPT",
             "INSP. STICKER-EXPIRED/MISSING",
             "REG. STICKER-EXPIRED/MISSING",
             "FIRE HYDRANT")) %>%
  count(violation, borough) %>% 
  pivot_wider(
    names_from = "violation",
    values_from = "n"
  ) %>% 
  data.matrix() %>% 
  subset(select = -c(borough))

rownames(five_common_violation) <- c("Bronx", "Brooklyn", "Manhattan", "Queens", "Staten Island")

five_common_violation %>% 
  knitr::kable(caption = "Results Table")
Results Table
FAIL TO DSPLY MUNI METER RECPT FIRE HYDRANT INSP. STICKER-EXPIRED/MISSING NO PARKING-STREET CLEANING REG. STICKER-EXPIRED/MISSING
Bronx 24005 32407 44436 46874 35410
Brooklyn 49347 43903 67654 101494 51038
Manhattan 49902 31794 40856 43963 35280
Queens 72767 31610 61256 60778 56589
Staten Island 6025 1946 15857 16 13492
chisq.test(five_common_violation)
## 
##  Pearson's Chi-squared test
## 
## data:  five_common_violation
## X-squared = 53839, df = 16, p-value < 2.2e-16
x_crit = qchisq(0.95, 16)
x_crit
## [1] 26.29623

The result of chi-square shows that \(\chi^2 > \chi_{crit}\), at significant level \(\alpha = 0.05\), so our result is statistically significant and we can conclude that there does exist at least one borough’s proportion of violation amounts is different from others.

Proportion Test

whether proportions of the population receiving fire hydrant parking violations in each borough are equal

Finally, since the population vary from boroughs, we want to standardize the population to assure the accuracy of former test results. We first derived the population of each borough from the most recent census. Next, we performed a proportion test to examine whether the proportion of fire hydrant violations (against the total population in each borough) differs across boroughs.

Besides, we will assume: 1) Each car has only one driver; 2) Each car gets one fire hydrant violation in 2021.

\(H_0:\) The proportions of the population getting tickets of fire hydrant in each borough is equal;

\(H_1:\) At least one proportion of the population getting fire hydrant violation in a borough is different from other boroughs.

url = "https://www.citypopulation.de/en/usa/newyorkcity/"
nyc_population_html = read_html(url)

population = nyc_population_html %>% 
  html_elements(".rname .prio1") %>% 
  html_text()

boro = nyc_population_html %>% 
  html_elements(".rname a span") %>% 
  html_text()

nyc_population = tibble(
  borough = boro,
  population = population %>% str_remove_all(",") %>% as.numeric()
) 
  
fire_hydrant = violation1 %>%
  select(borough, violation, plate) %>% 
  filter(violation == "FIRE HYDRANT") %>%
  distinct(plate, borough) %>% 
  count(borough) 

boro_population = left_join(fire_hydrant, nyc_population)
## Joining, by = "borough"
boro_population %>% 
  knitr::kable(caption = "Results Table")
Results Table
borough n population
Bronx 18432 1472654
Brooklyn 28124 2736074
Manhattan 20809 1694251
Queens 21832 2405464
Staten Island 1611 495747
prop.test(boro_population$n, boro_population$population)
## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  boro_population$n out of boro_population$population
## X-squared = 4127.7, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##      prop 1      prop 2      prop 3      prop 4      prop 5 
## 0.012516178 0.010278962 0.012282123 0.009076004 0.003249641

From the above results, p-values are small and so we we can say that the proportions of people getting fire hydrant violations are different across boroughs.

 

Mapping and Spatial Analysis

We would like to create some maps of fire hydrant violations by borough. We also want to investigate whether we can find hydrants that are most frequently ticketed.

Geolocation (NYCGeoSearch)

After trying several different geolocating methods, we found that NYCGeoSearch provides the best platform to obtain geographical coordinates (latitude, longitude) from the street addresses we were provided. This platform is advantageous as we can manipulate the search terms by altering the website’s URL. After passing the website into the fromJSON() function from the jsonlite package, we can scrape the page for the data we are interested in (coordinates and borough).

hydrant <- 
  violation %>% 
  filter(violation %in% c("FIRE HYDRANT"), !is.na(geo_nyc_address)) 

pb <- progress_bar$new(total = nrow(hydrant)) # adding progress bar for sanity

get_coord <- function(url) {
  pb$tick()
  
  json_output <- fromJSON(url(url))$features
  
  coord <- json_output$geometry[2]
  borough <-  json_output$properties$borough
  
  out_df <- 
    tibble(
      coordinates = coord$coord,
      mapped_borough = borough
    )
  return(out_df)
}

hydrant_lat_long <- 
  hydrant %>%
  mutate(
  new_url = paste0("https://geosearch.planninglabs.nyc/v1/search?text=", geo_nyc_address, "&size=25"),
  coord = map(new_url, get_coord)
) %>%
  unnest(coord) %>%
  filter(mapped_borough == borough)

The address information provided in our raw data is not very complete and not standardized, as some observations are missing address components (such as house number or street name) and some observations have address components that are not recognizable by the platform (i.e. “30 feet west of Broadway”). As such, the platform produces many possible coordinates for some addresses while not being able to find any coordinates for other observations.

We will work with only the observations that NYCGeoSearch was able to find location coordinates for. We will further restrict this dataset to only include coordinate results that were in the same borough as reported on the citation, and then select just the first coordinate results, if multiple exist.

Finally, we will process the resulting list column through a for loop to extract the latitude and longitudes from each coordinate to their own columns (named lat and long).

hydrant_lat_long <-
  hydrant_lat_long %>%
  group_by(summons_number, geo_nyc_address) %>%
  slice(1)

for (i in 1:nrow(hydrant_lat_long)) {
  hydrant_lat_long$lat[i] <- hydrant_lat_long$coordinates[[i]][2]
  hydrant_lat_long$long[i] <- hydrant_lat_long$coordinates[[i]][1]
}

Running through all previous code chunks up until here takes approximately 7 hours. We will go ahead and save the output from running this code in hydrant_lat_long.csv so that we can read the data from this file into R, instead of running the previous chunks.

write_csv(hydrant_lat_long, "hydrant_lat_long.csv")

We will also load a dataset containing locations of all fire hydrants in NYC, from a file named Hydrants.csv. After tidying and recoding some of the variables, we end up with a dataset where each record corresponds with a single fire hydrant. This dataset will contain the variables borough (borough the fire hydrant is located in), unitid (hydrant identifier), and lat and long (hydrant lat/long coordinates).

hydrant_lat_long <- read_csv("hydrant_lat_long.csv") %>% select(-coordinates)

hydrant_lat_long <- hydrant_lat_long %>% mutate(textlab = paste0("Summons Number: ", summons_number, "\nFine Amount: $", fine_amount))

hydrant_actual <- read_csv("Hydrants.csv")

hydrant_actual <- 
  hydrant_actual %>%
  mutate(borough = case_when(
    BORO == 1 ~ "Manhattan",
    BORO == 2 ~ "Bronx",
    BORO == 3 ~ "Brooklyn",
    BORO == 4 ~ "Queens",
    BORO == 5 ~ "Staten Island"
  )) %>%
  rename(
    lat = LATITUDE,
    long = LONGITUDE,
    unitid = UNITID
  )
Mapping of Fire Hydrant Violation Locations and Fire Hydrants

We will now plot all the points that we have produced to get a sense of how the fire hydrants are distributed across NYC and where fire hydrant violations occurred in the city.

hydrant_violation_plot <- 
  hydrant_lat_long %>% 
  plot_ly(
    lat = ~lat, 
    lon = ~long, 
    type = "scattermapbox", 
    mode = "markers", 
    alpha = 0.2,
    color = ~borough) %>% 
  layout(
    mapbox = list(
      style = 'carto-positron',
      zoom = 9,
      center = list(lon = -73.9, lat = 40.7)))

hydrant_violation_location_plot <- 
  hydrant_violation_plot %>% 
  add_trace(
  data = hydrant_actual %>% mutate(borough = "Fire Hydrant"),
  lat = ~lat,
  lon = ~long,
  type = "scattermapbox",
  mode = "markers",
  alpha = 0.02,
  color = ~borough
  ) %>% 
  layout(
    mapbox = list(
      style = 'carto-positron',
      zoom = 9,
      center = list(lon = -73.9, lat = 40.7)),
    title = "<b> Parking Violation and Hydrant Location </b>",
     legend = list(title = list(text = "Borough of Violation or Hydrant", size = 9),
                    orientation = "h",
                   font = list(size = 9)))

hydrant_violation_location_plot

This map produces points as expected. We see there are more fire hydrant violations in higher-density areas, particularly regions closer to (or in) Manhattan. We see a similar pattern in the distribution of fire hydrants throughout NYC. If we look a little closer at Manhattan, we can observe some distinct regions of higher fire hydrant violations throughout the borough – notably, around SoHo/downtown, Midtown, around the perimeter of Central Park, and around the Upper East Side.

Commonly Ticketed Fire Hydrants

Finally, we want to find the hydrants that are most commonly ticketed.

According to NYC regulations, cars (or, drivers) are ticketed if they park within 15 feet of a hydrant. We will try to associate violation location points with the fire hydrant that caused the violation by constructing hitboxes of approximately 15 feet in radius around each hydrant and see how many violation points are within those boxes. We will double this radius to somewhat adjust for a lack of precision in the violation coordinates we have obtained. Towards this end, we will construct square hitboxes around our hydrant points such that each side of the square is 60 feet wide.

We will use the function destPoint() from the package geosphere to calculate this. This function takes a latitude and longitude coordinate, a bearing (measured from North), and a distance in meters. We will use this function to obtain the coordinates of the vertices of a square centered around the position of the fire hydrant, 35 feet (10 meters) in radius (measured from center to edge).

After we have done so, we will utilize a few functions from the sp package. First, we will use SpatialPolygons (and associated function Polygons) to construct the squares around each hydrant. Then, the function over will tell us which violation location points land within the hydrant hitboxes we have created.

# distance in meters
d = 10
 
square_coord <- function(lat, long, dist = 5){
 
 point_init <- destPoint(c(long, lat), b = 0, d = dist)
 point1 <- destPoint(point_init, b = 270, d = dist) %>% as.data.frame()
 point2 <- destPoint(point1, b = 180, d = dist*2) %>% as.data.frame()
 point3 <- destPoint(point2, b = 90, d = dist*2) %>% as.data.frame()
 point4 <- destPoint(point3, b = 0, d = dist*2) %>% as.data.frame()
 
 sq <- bind_rows(point1, point2, point3, point4, point1)
 
 x <- sq %>% pull(lat)
 y <- sq %>% pull(lon)
 
 out_mat <- cbind(x, y)
 
 return(out_mat)
}

to_poly <- function(polymat, id){
 poly <- Polygons(list(Polygon(polymat)), ID = id)
 return(poly)
}

first_step <- hydrant_actual %>%
mutate(
  sq = map2(.x = lat, .y = long, ~square_coord(lat = .x, long = .y, dist = d)),
  polys = map2(sq, unitid, to_poly)
)  %>% 
select(polys) %>%
pull(polys)

squares <- SpatialPolygons(first_step)

# now, get points ready 

x = hydrant_lat_long %>% pull(lat)
y = hydrant_lat_long %>% pull(long)
xy = cbind(x,y)

dimnames(xy)[[1]] = hydrant_lat_long %>% pull(summons_number)
pts = SpatialPoints(xy)
  
# check if drawn squares have points in them
hits <- over(squares, pts, returnList = TRUE)


hit_hydrants <- tibble(id = names(hits), hits = hits)
hit_hydrants1 <- hit_hydrants %>% unnest(hits)
hit_hydrants2 <- hit_hydrants1 %>% group_by(id) %>% summarise(num_hits = n()) %>% arrange(-num_hits)

isolated_hydrants <- inner_join(hydrant_actual, hit_hydrants2, by = c("unitid" = "id")) 

We see that there are 837 hydrants. We will first reverse geocode the hydrant coordinates to allow the output to be more useful to the human user. We will use the revgeo function from the revgeo package and use Google’s reverse geocode API to get the addresses. Since using Google’s API is not exactly free, we will perform this once and save the addresses in a CSV file.

hydrant_loc <- revgeo(isolated_hydrants %>% pull(long), isolated_hydrants %>% pull(lat), provider = "google", API = "API") # API key has been removed

address <- hydrant_loc %>% unlist()
hydrant_loc <- bind_cols(isolated_hydrants, address = address)

write_csv(hydrant_loc, "isolated_hydrant_loc.csv")

Let’s now create our plot!

isolated_hydrants <- read_csv("isolated_hydrant_loc.csv")

isolated_hydrants_plot <- 
  isolated_hydrants %>%
  mutate(
    borough = "Hydrant",
    textlab = paste0("Number of Associated Violations: ", num_hits, "\nAddress: ", address)
    ) 

hydrant_plot <- 
  hydrant_lat_long %>% 
  plot_ly(
    lat = ~lat, 
    lon = ~long, 
    type = "scattermapbox", 
    mode = "markers", 
    alpha = 0.01,
    color = ~borough,
    text = ~textlab,
    colors = "viridis")

plot <- hydrant_plot %>% add_trace(
  data = isolated_hydrants_plot,
  lat = ~lat,
  lon = ~long,
  type = "scattermapbox",
  mode = "markers",
  size = ~num_hits,
  text = ~textlab,
  color = ~borough,
  marker = list(
    color = "red"
    ),
  alpha = 0.5
  )

plot %>% layout(
    mapbox = list(
      style = 'carto-positron',
      zoom = 9,
      center = list(lon = -73.9, lat = 40.7)),
      title = "<b> Most Frequently Ticketed Hydrants</b>",
     legend = list(title = list(text = "Borough of Violation or Hydrant", size = 9),
                    orientation = "h",
                   font = list(size = 9)))

Interestingly, it seems that the hydrant whose hitbox intersected the most violation points is located in Brooklyn. The following table shows the ten most frequently ticketed hydrants:

hydrants_table <- 
  isolated_hydrants %>% 
  arrange(-num_hits) %>% 
  filter(row_number() <= 10) %>%
  select(unitid, address, num_hits) %>%
  rename(
    c("Hydrant ID" = "unitid", "Hydrant Address" = "address", "Number of Tickets" = "num_hits")
  )
  hydrants_table %>% knitr::kable()
Hydrant ID Hydrant Address Number of Tickets
H301940 2 Atlantic Ave., Pier 7, Brooklyn, NY 11201, USA 40
H202171 522 Morris Ave, Bronx, NY 10451, USA 34
H311717 150 Graham Ave, Brooklyn, NY 11206, USA 26
H107599 25 James St, New York, NY 10038, USA 21
H420999 197-50A Peck Ave, Fresh Meadows, NY 11365, USA 16
H205952 530 E 144th St, Bronx, NY 10454, USA 15
H203269 1468 College Ave, Bronx, NY 10457, USA 15
H101773 Minetta Green, S/e Corner Minetta Lane and, 6th Ave, New York, NY 10012, USA 14
H332046 8 Old Fulton St, Brooklyn, NY 11201, USA 14
H326869 11 Schenck Ct, Brooklyn, NY 11207, USA 13

It is important to note that the above spatial analysis will only be as precise as the geocoded violation coordinates will be.

 

Discussion

EDA Findings:

  • Brooklyn has the highest number of tickets among five boroughs. Manhattan and Queens have approximately the same number of tickets.

  • There is a sharp increase in number of tickets in October in all boroughs.

  • Top three types of violation: No Parking-Street Cleaning, Inspection Sticker Expired/Missing, Fail to Display a muni-meter receipt.

  • Saturday has the lowest violation counts and Friday had the second lowest counts comparing to all other weekdays.

  • Violation increases from 6:00 AM to 8:00 AM, remains high in 8:00 AM to 2:00 PM, decrease from 2:00 PM to 7:00 PM.

  • Average fine amount is highest in Manhattan and lowest in Staten Island. There is significant difference between fine amounts and at least 2 boroughs.

  • Suburban(SUBN) is the vehicle type that has the highest number of violation.

Statistical Analysis Findings:

  • ANOVA test: the parking violation counts are associated with month and weekday.

  • Chi-square test: the proportions of violation amounts is different among boroughs.

  • Proportion test: the proportions of people getting fire hydrant violations are different across boroughs.